Skip to content

Wave 2: PanelProfile outcome/dose shape extensions + autonomous-guide worked examples#366

Merged
igerber merged 20 commits intomainfrom
agent-profile-shape-extensions
Apr 25, 2026
Merged

Wave 2: PanelProfile outcome/dose shape extensions + autonomous-guide worked examples#366
igerber merged 20 commits intomainfrom
agent-profile-shape-extensions

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

  • Extend PanelProfile with outcome_shape: Optional[OutcomeShape] (numeric outcomes) and treatment_dose: Optional[TreatmentDoseShape] (continuous treatments). New fields populate distributional facts that gate WooldridgeDiD (QMLE) and ContinuousDiD prerequisites pre-fit.
  • Add three end-to-end worked examples to llms-autonomous.txt (binary-staggered with never-treated, continuous-dose with zero baseline, count-shaped outcome). New §5 between the existing §4 reasoning and §5 post-fit validation; existing §5–§8 renumbered to §6–§9.
  • Top-level exports: OutcomeShape, TreatmentDoseShape from diff_diff. Both new dataclasses are frozen and JSON-serializable via PanelProfile.to_dict().

Methodology references (required if estimator / math changes)

  • Method name(s): profile_panel is descriptive (not an estimator); the worked examples reference CallawaySantAnna, WooldridgeDiD, ContinuousDiD. No estimator math is changed in this PR.
  • Paper / source link(s): no new methodology claims beyond what is already in docs/methodology/REGISTRY.md.
  • Any intentional deviations from the source: None. The is_time_invariant field uses exact distinct-count on observed non-zero doses (matching the documented contract and ContinuousDiD.fit()'s exact equality check). The is_count_like and is_integer_valued fields are explicitly labeled heuristics in the docstring + guide.

Validation

  • Tests added/updated: 17 new unit tests in tests/test_profile_panel.py covering each shape heuristic (count-like Poisson, binary-as-not-count-like, continuous normal, bounded unit, categorical returning None, skewness/kurtosis gating, JSON roundtrip, time-invariant dose, time-varying dose, no-zero-dose, binary/categorical treatment returning None, frozen invariants on both new dataclasses, sub-1e-8 dose-precision regression). 2 new content-stability tests in tests/test_guides.py guard the §5 worked-examples block and the new field references.
  • Backtest / simulation / notebook evidence: N/A (descriptive utility + documentation).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

@github-actions
Copy link
Copy Markdown

Overall assessment

⚠️ Needs changes — one unmitigated P1 methodology/contract mismatch in the new ContinuousDiD pre-fit gating surface.

Executive summary

  • P1: The new TreatmentDoseShape fields are documented as the ContinuousDiD prerequisites, but they do not match what ContinuousDiD.fit() actually enforces. That can mark ineligible panels as eligible pre-fit.
  • The mismatch is internal as well as methodological: the existing autonomous guide still says ContinuousDiD requires has_never_treated and treatment_varies_within_unit == False, while the new section shifts the gate to row-level has_zero_dose and nonzero-only is_time_invariant.
  • P2: The new worked example uses WooldridgeDiD(family="poisson"), but the public API is WooldridgeDiD(method="poisson"). The new guide test hard-codes the wrong call.
  • The new tests do not cover the two cases that would have exposed the P1: 0,0,d,d within-unit dose paths, and row-level zeros without any never-treated units.
  • I did not find a mitigating entry for the P1 in TODO.md. This was a static source review; I could not execute the code in this environment.

Methodology

  • Severity: P1
    Impact: ContinuousDiD is still implemented as requiring unit-level never-treated controls (first_treat == 0) and full per-unit dose constancy, but the new profile surface defines has_zero_dose as “any zero row” and is_time_invariant on nonzero doses only, then states those new fields are the prerequisites. That is an undocumented mismatch between the shipped pre-fit guidance and the estimator contract. A panel with pre-period zeros and no never-treated units, or a 0,0,d,d dose path, can satisfy the new guide’s gate while ContinuousDiD.fit() still raises.
    Concrete fix: Either align _compute_treatment_dose() to the actual estimator contract, or keep the fields descriptive and stop claiming they alone gate ContinuousDiD. In either case, keep has_never_treated and full-path time-invariance as the authoritative prerequisites unless you also change ContinuousDiD.fit() and document that deviation in REGISTRY.md.
    Locations: diff_diff/profile.py:L656-L710, diff_diff/continuous_did.py:L222-L227, diff_diff/continuous_did.py:L348-L360, diff_diff/guides/llms-autonomous.txt:L115-L140, diff_diff/guides/llms-autonomous.txt:L198-L211, diff_diff/guides/llms-autonomous.txt:L690-L723, docs/methodology/REGISTRY.md:L719-L745

Code Quality

  • No findings in the changed Python implementation beyond the methodology contract issue above.

Performance

  • No findings.

Maintainability

  • No separate findings.

Tech Debt

  • No separate findings.
  • The P1 above is not mitigated by TODO.md; I did not find an existing tracked entry for it.

Security

  • No findings.

Documentation/Tests

  • Severity: P2
    Impact: The new autonomous guide tells users/agents to call WooldridgeDiD(family="poisson"), but the estimator constructor accepts method, not family. Anyone following the example will get a runtime error, and the new guide test now locks that bad example in place.
    Concrete fix: Change the guide and the stability test to WooldridgeDiD(method="poisson").
    Locations: diff_diff/guides/llms-autonomous.txt:L767-L771, diff_diff/wooldridge.py:L273-L275, tests/test_guides.py:L62-L64

  • Severity: P2
    Impact: The new tests only cover “constant dose in every row” and “changing nonzero dose,” so they miss the exact contradictory cases introduced by the new guidance: 0,0,d,d within-unit paths and zero rows without any never-treated units. That leaves the P1 above unguarded.
    Concrete fix: Add regression tests asserting consistency between profile_panel guidance and ContinuousDiD.fit() for:

    1. balanced panel with 0,0,d,d dose paths;
    2. panel with row-level zeros but no never-treated units.
      Locations: tests/test_profile_panel.py:L1028-L1105

Path to Approval

  1. Make the new TreatmentDoseShape contract consistent with ContinuousDiD.fit() and the Methodology Registry, or explicitly scope it as descriptive-only and restore the existing prerequisites (has_never_treated, full-path time invariance) in the guide.
  2. Add regression tests for the two contradictory continuous-dose scenarios: 0,0,d,d paths and row-level zeros without never-treated controls.
  3. Fix the autonomous guide and its stability test to use WooldridgeDiD(method="poisson").

igerber added a commit that referenced this pull request Apr 24, 2026
…ptive-only; fix WooldridgeDiD method kwarg

P1 (TreatmentDoseShape vs ContinuousDiD contract):
- Reviewer correctly flagged that the new `is_time_invariant` field
  (per-unit non-zero distinct-count) does NOT match the actual
  `ContinuousDiD.fit()` gate at `continuous_did.py:222-228`, which
  uses `df.groupby(unit)[dose].nunique() > 1` over the FULL dose
  column (including pre-treatment zeros). My nonzero-only check
  silently classified `0,0,d,d` paths as time-invariant while
  ContinuousDiD would reject them.
- Removed `is_time_invariant` field from `TreatmentDoseShape`
  entirely. The pre-existing `PanelProfile.treatment_varies_within_unit`
  field already encodes the correct ContinuousDiD prerequisite (matches
  the estimator's nunique check at line 224) and is correctly documented
  in §2 of the autonomous guide. Adding a second, narrower, mismatched
  gate was confusing - the reviewer's "scope as descriptive-only" path
  is the cleaner fix.
- Reframed `TreatmentDoseShape` docstring + autonomous guide §2
  field reference: explicitly NOT a ContinuousDiD prerequisite.
  `n_distinct_doses`, `has_zero_dose`, `dose_min/max/mean` provide
  descriptive distributional context; `has_never_treated` (unit-level)
  + `treatment_varies_within_unit == False` (full-path constancy) +
  `is_balanced` are the authoritative gates.
- Rewrote §5.2 worked example reasoning chain to use the existing
  correct gates and added a counter-example showing
  `has_zero_dose=True` does NOT imply `has_never_treated=True` (the
  row-level vs unit-level distinction).
- Added `test_treatment_dose_does_not_gate_continuous_did` covering
  the two contradictory cases the reviewer named: (1) `0,0,d,d`
  within-unit dose path, asserting `treatment_varies_within_unit=True`
  (the actual ContinuousDiD gate fires correctly); (2) row-level zeros
  without never-treated units, asserting `has_zero_dose=True` BUT
  `has_never_treated=False` (the two facts are distinct).
- Removed `test_treatment_dose_continuous_time_varying_within_unit`
  and `test_treatment_dose_distinguishes_doses_at_high_precision` -
  both tested the dropped `is_time_invariant` field.

P2 (WooldridgeDiD constructor kwarg):
- The autonomous guide §5.3 worked example used
  `WooldridgeDiD(family="poisson")` but the actual constructor at
  `wooldridge.py:264` takes `method=`. Following the example would
  raise `TypeError: __init__() got an unexpected keyword argument
  'family'`. Fixed in two places (the prose and the code snippet)
  and added a negative assertion in `test_guides.py` to prevent
  regression: `assert 'WooldridgeDiD(family="poisson")' not in text`.

CHANGELOG updated to reflect the revised TreatmentDoseShape scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cc91a7dad85dbb36ec713b2deedeae7d0938e3bc


Overall Assessment
Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The prior re-review P1 is resolved: the new TreatmentDoseShape surface is now explicitly descriptive-only, and the guide/tests defer to the same ContinuousDiD gates the estimator and Methodology Registry enforce: has_never_treated, treatment_varies_within_unit, and is_balanced. diff_diff/profile.py:L64-L83, diff_diff/profile.py:L658-L697, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/guides/llms-autonomous.txt:L732-L757, tests/test_profile_panel.py:L1057-L1115
  • The prior re-review P2 is resolved: the worked example now uses WooldridgeDiD(method="poisson"), matching the public API and the Wooldridge documentation. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L276, docs/methodology/REGISTRY.md:L1333-L1349, tests/test_guides.py:L50-L71
  • I did not find estimator-math, weighting, variance/SE, identification-assumption, or default-behavior changes in the PR; the changed surface is descriptive profiling, exports, guide content, and tests. diff_diff/profile.py:L43-L200, diff_diff/__init__.py:L253-L259, diff_diff/guides/llms-autonomous.txt:L637-L811
  • P3: ROADMAP.md still overstates the new fields as if they themselves gate Wooldridge/ContinuousDiD routing, which conflicts with the corrected guide/docstrings. ROADMAP.md:L141-L141
  • P3: the new top-level exports lack a matching regression in the existing top-level import-surface test. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L508, tests/test_profile_panel.py:L404-L413
  • Static source review only: I could not execute tests here because this environment is missing pytest and even numpy.

Methodology

  • No findings. The previous ContinuousDiD contract mismatch is fixed and now matches both the estimator implementation and REGISTRY.md. diff_diff/profile.py:L64-L83, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745
  • No findings. The previous Wooldridge guide/API mismatch is fixed. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L276

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • P3ROADMAP.md still says the new extensions expose facts that “gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit,” which reintroduces the exact conceptual confusion the guide/docstrings just corrected; reword it to say the new fields add descriptive outcome/dose-shape context, while the authoritative ContinuousDiD gates remain has_never_treated, treatment_varies_within_unit, and is_balanced, and outcome_shape.is_count_like only informs Poisson-QMLE consideration. ROADMAP.md:L141-L141, compare diff_diff/profile.py:L64-L83, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507

Tech Debt

  • No findings. I do not see a new correctness issue that should be deferred into TODO.md.

Security

  • No findings.

Documentation/Tests

  • P3 — The PR adds OutcomeShape and TreatmentDoseShape to the public top-level namespace, but test_top_level_import_surface() still only covers profile_panel, PanelProfile, and Alert; extend that test to assert diff_diff.OutcomeShape, diff_diff.TreatmentDoseShape, and their presence in diff_diff.__all__. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L508, tests/test_profile_panel.py:L404-L413
  • Review note: I could not run the added tests in this environment because pytest is unavailable and importing the package runtime also fails for lack of numpy.

igerber added a commit that referenced this pull request Apr 24, 2026
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 370915efd4056efc0cfe882a96041818a5d62239


Overall assessment

Looks good — no unmitigated P0/P1 findings. The prior substantive re-review issues I checked are resolved.

Executive summary

  • The earlier ContinuousDiD contract concern is fixed: the new treatment_dose surface is now explicitly descriptive-only, and the guide points back to the same field-based gates enforced in ContinuousDiD.fit() and documented in the registry. diff_diff/profile.py:L64-L89, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745
  • The earlier Wooldridge worked-example/API mismatch is fixed: the guide now uses WooldridgeDiD(method="poisson"), matching the public API and the Wooldridge registry entry. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L275, docs/methodology/REGISTRY.md:L1333-L1349
  • The earlier top-level export/test gap is fixed: OutcomeShape and TreatmentDoseShape are exported from diff_diff and covered by the import-surface regression. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L510, tests/test_profile_panel.py:L404-L422
  • P3: the new “authoritative ContinuousDiD pre-fit gates” wording now sounds exhaustive but omits the separate duplicate-cell hard stop already documented elsewhere; _precompute_structures() still resolves duplicate (unit, time) rows by last-row-wins overwrite. diff_diff/guides/llms-autonomous.txt:L330-L343, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L818-L823, diff_diff/profile.py:L69-L75, CHANGELOG.md:L11-L11, ROADMAP.md:L141-L141
  • Static review only: I could not run tests here because this environment is missing pytest, numpy, and pandas.

Methodology

  • No P0/P1 findings. The changed guide/docstring text now matches the current ContinuousDiD and WooldridgeDiD contracts and does not introduce an undocumented estimator-methodology deviation.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the documentation item below.

Tech Debt

  • No findings. I did not see a related mitigation entry needed in TODO.md.

Security

  • No findings.

Documentation/Tests

  • P3 — Impact: the new ContinuousDiD prerequisite summary omits the already-documented duplicate-row hard gate, so a reader treating the new treatment_dose summary as exhaustive can still route duplicate-containing panels into a code path that silently overwrites earlier (unit, time) rows. Concrete fix: in the new TreatmentDoseShape docs and the related guide/ROADMAP/CHANGELOG wording, change “authoritative gates are ...” to “core field-based gates are ... plus no duplicate_unit_time_rows alert,” and add a regression in tests/test_guides.py that the ContinuousDiD prerequisite summary still mentions duplicates. diff_diff/guides/llms-autonomous.txt:L330-L343, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L818-L823, diff_diff/profile.py:L69-L75, CHANGELOG.md:L11-L11, ROADMAP.md:L141-L141
  • Verification note: I could not execute the new tests in this environment.

igerber added a commit that referenced this pull request Apr 25, 2026
…ontinuousDiD prerequisite summaries

Reviewer correctly noted that the round-2 wording lists
`has_never_treated` + `treatment_varies_within_unit == False` +
`is_balanced` as the "authoritative" ContinuousDiD pre-fit gates but
omits the duplicate-cell hard stop. Verified
`continuous_did.py:_precompute_structures` (line 818-823) builds
`outcome_matrix` cell-by-cell with last-row-wins on duplicate
`(unit, time)` keys - so absence of the `duplicate_unit_time_rows`
alert is also a real prerequisite, not just a style preference.

Updated wording in five places to add "+ absence of the
`duplicate_unit_time_rows` alert" alongside the other gates and
explain the silent-overwrite behavior:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring
- `diff_diff/guides/llms-autonomous.txt` §2 field reference
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature)
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning
  chain (now lists four gates instead of three)
- `CHANGELOG.md` Unreleased entry
- `ROADMAP.md` AI-Agent Track building-block

Also softened "authoritative" -> "core field-based" since the
non-field-based duplicate-row gate makes the original phrasing
slightly misleading.

Added a test_guides.py regression asserting the autonomous guide
mentions `duplicate_unit_time_rows` so future wording changes can't
silently drop the gate from the summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 22ff5dc161534d087d0aa191b74f04efd99eaa43


Overall Assessment

⚠️ Needs changes — I found two unmitigated P1 methodology issues in the new count-outcome guidance/routing.

Executive Summary

  • The prior duplicate-row documentation issue appears resolved: the ContinuousDiD prerequisite summary now keeps duplicate_unit_time_rows in scope.
  • P1: The new Wooldridge Poisson guidance says the estimator returns a multiplicative/proportional effect, but the implementation and reporting contract expose an ASF-based ATT difference on the outcome scale.
  • P1: The new OutcomeShape.is_count_like heuristic does not exclude negative support, so it can steer negative integer outcomes toward WooldridgeDiD(method="poisson"), which hard-rejects negative outcomes.
  • P2: The added tests pin section presence and API spelling, but they do not guard the nonlinear Wooldridge target-parameter wording or the negative-count edge case.
  • Static review only: I could not run the test suite because pytest is not installed in this environment.

Methodology

  • Severity: P1. Affected method: WooldridgeDiD(method="poisson"). Impact: The new guide tells agents that Poisson ETWFE “estimates the multiplicative effect directly” and that they should “report the multiplicative effect (proportional change),” but the library’s own methodology contract is ASF-based ATT, and the Poisson path computes E[exp(η_1)] - E[exp(η_0)], i.e. an outcome-scale difference, not a ratio. An agent following the new worked example can therefore misreport overall_att on the wrong scale. Concrete fix: Rewrite the §4.11 / §5.3 wording so Poisson ETWFE is described as an ASF-based ATT on the natural outcome scale; if you want to mention proportional effects, label them as a separate derived interpretation, not the estimator’s overall_att target. Refs: diff_diff/guides/llms-autonomous.txt:L621-L629, diff_diff/guides/llms-autonomous.txt:L801-L811, docs/methodology/REGISTRY.md:L1335-L1357, diff_diff/wooldridge.py:L1191-L1225, diff_diff/_reporting_helpers.py:L262-L281.
  • Severity: P1. Affected surface: profile_panel outcome-shape routing for WooldridgeDiD(method="poisson"). Impact: is_count_like currently requires integer-valued data, zeros, positive skew, and >2 distinct values, but it does not require non-negative support. That means a right-skewed integer outcome with zeros and some negative values can still set is_count_like=True, after which the new guide treats it as a Poisson candidate even though the Poisson fitter raises on any negative outcome. This is a missing edge-case guard on the new routing signal. Concrete fix: Add a non-negativity requirement to is_count_like (value_min >= 0 or equivalent), document it, and add a regression test with negative integer outcomes plus zeros expecting is_count_like=False. Refs: diff_diff/profile.py:L621-L645, diff_diff/guides/llms-autonomous.txt:L621-L629, diff_diff/wooldridge.py:L1105-L1108.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see these P1 items mitigated in TODO.md, and they are not deferrable under the review policy.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests verify the worked-examples section exists and that the Poisson example uses method= rather than family=, but they do not verify the nonlinear Wooldridge estimand wording or the negative-outcome exclusion. Both P1 issues above can therefore pass the current suite. Concrete fix: Add guide-content assertions that §4.11/§5.3 do not describe overall_att as “multiplicative” / “proportional change,” and add a profile_panel regression test for a right-skewed integer outcome with zeros and some negative values. Refs: tests/test_guides.py:L62-L83, tests/test_profile_panel.py:L887-L907.

Path to Approval

  1. Fix the Poisson ETWFE wording in diff_diff/guides/llms-autonomous.txt so it matches the ASF-based ATT contract already documented in REGISTRY.md and _reporting_helpers.py.
  2. Tighten OutcomeShape.is_count_like to require non-negative support before it can steer users toward Poisson.
  3. Add regression tests for both the estimand wording and the negative-integer edge case.

igerber added a commit that referenced this pull request Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b444300244662048617cd21fc6357d5cae0b1e32


Overall Assessment

⚠️ Needs changes — the two prior Wooldridge-related P1s look resolved in the provided diff, but there is one P1 [Newly identified] in the new ContinuousDiD pre-fit routing/docs.

Executive Summary

  • The prior Poisson issues appear fixed in the provided diff: WooldridgeDiD(method="poisson") is now described as ASF/outcome-scale, and outcome_shape.is_count_like now excludes negative outcomes.
  • P1 [Newly identified]: the new treatment_dose / ContinuousDiD prerequisite summaries omit the estimator’s strictly-positive-treated-dose restriction. A panel can satisfy every documented gate and still hard-fail in ContinuousDiD.fit() on negative dose support.
  • The added tests cover the two prior Poisson regressions, but there is still no regression around negative-dose ContinuousDiD routing.
  • No findings in code quality, performance, security, or tracked-tech-debt handling.
  • Static review only: pytest is not installed in this environment, so I could not run the suite.

Methodology

Re-review note: the previous Poisson estimand-wording and negative-count-routing findings appear resolved in the provided diff.

  • Severity: P1 [Newly identified]. Impact: the PR now presents has_never_treated, treatment_varies_within_unit == False, is_balanced, and no duplicate_unit_time_rows alert as the authoritative pre-fit gates for ContinuousDiD, and §5.2 says that if those pass the estimator is “in scope.” That omits the implementation’s hard support restriction that treated doses must be strictly positive; ContinuousDiD.fit() raises on negative treated dose support, and the methodology note is written on D > 0 support. An autonomous agent can therefore route a balanced, never-treated, time-invariant panel with treatment_dose.dose_min < 0 into a deterministic fit-time error. Concrete fix: extend the new routing contract so ContinuousDiD is only in scope when nonzero doses are strictly positive as well; at minimum, document treatment_dose.dose_min > 0 as a hard gate, and preferably expose an explicit negative-dose alert/boolean. Refs: diff_diff/profile.py:L65-L85, diff_diff/guides/llms-autonomous.txt:L737-L751, diff_diff/continuous_did.py:L287-L294, docs/methodology/continuous-did.md:L180-L184.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find the ContinuousDiD gate omission documented as a registry deviation or tracked in TODO.md, so it is unmitigated.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests now guard the prior Poisson regressions, but they still do not cover the expanded ContinuousDiD routing surface for negative-dose panels. The treatment-dose tests only cover zero-baseline/no-zero/binary/categorical cases, and the guide tests do not assert any negative-dose exclusion. Concrete fix: add a regression with a balanced, never-treated, time-invariant continuous panel containing a negative nonzero dose and assert the routing surface rejects ContinuousDiD (via explicit alert/boolean or guide wording keyed to dose_min < 0). Refs: tests/test_profile_panel.py:L1037-L1187, tests/test_guides.py:L62-L105.

Path to Approval

  1. Update the new TreatmentDoseShape / autonomous-guide prerequisite summaries so ContinuousDiD is only “in scope” when nonzero doses are strictly positive in addition to being zero-supported, time-invariant, balanced, and duplicate-free.
  2. Add a targeted regression for a negative-dose continuous panel that otherwise passes the existing documented gates, so this support restriction is covered going forward.

@igerber igerber force-pushed the agent-profile-shape-extensions branch from b444300 to ed57c7c Compare April 25, 2026 00:41
igerber added a commit that referenced this pull request Apr 25, 2026
…ptive-only; fix WooldridgeDiD method kwarg

P1 (TreatmentDoseShape vs ContinuousDiD contract):
- Reviewer correctly flagged that the new `is_time_invariant` field
  (per-unit non-zero distinct-count) does NOT match the actual
  `ContinuousDiD.fit()` gate at `continuous_did.py:222-228`, which
  uses `df.groupby(unit)[dose].nunique() > 1` over the FULL dose
  column (including pre-treatment zeros). My nonzero-only check
  silently classified `0,0,d,d` paths as time-invariant while
  ContinuousDiD would reject them.
- Removed `is_time_invariant` field from `TreatmentDoseShape`
  entirely. The pre-existing `PanelProfile.treatment_varies_within_unit`
  field already encodes the correct ContinuousDiD prerequisite (matches
  the estimator's nunique check at line 224) and is correctly documented
  in §2 of the autonomous guide. Adding a second, narrower, mismatched
  gate was confusing - the reviewer's "scope as descriptive-only" path
  is the cleaner fix.
- Reframed `TreatmentDoseShape` docstring + autonomous guide §2
  field reference: explicitly NOT a ContinuousDiD prerequisite.
  `n_distinct_doses`, `has_zero_dose`, `dose_min/max/mean` provide
  descriptive distributional context; `has_never_treated` (unit-level)
  + `treatment_varies_within_unit == False` (full-path constancy) +
  `is_balanced` are the authoritative gates.
- Rewrote §5.2 worked example reasoning chain to use the existing
  correct gates and added a counter-example showing
  `has_zero_dose=True` does NOT imply `has_never_treated=True` (the
  row-level vs unit-level distinction).
- Added `test_treatment_dose_does_not_gate_continuous_did` covering
  the two contradictory cases the reviewer named: (1) `0,0,d,d`
  within-unit dose path, asserting `treatment_varies_within_unit=True`
  (the actual ContinuousDiD gate fires correctly); (2) row-level zeros
  without never-treated units, asserting `has_zero_dose=True` BUT
  `has_never_treated=False` (the two facts are distinct).
- Removed `test_treatment_dose_continuous_time_varying_within_unit`
  and `test_treatment_dose_distinguishes_doses_at_high_precision` -
  both tested the dropped `is_time_invariant` field.

P2 (WooldridgeDiD constructor kwarg):
- The autonomous guide §5.3 worked example used
  `WooldridgeDiD(family="poisson")` but the actual constructor at
  `wooldridge.py:264` takes `method=`. Following the example would
  raise `TypeError: __init__() got an unexpected keyword argument
  'family'`. Fixed in two places (the prose and the code snippet)
  and added a negative assertion in `test_guides.py` to prevent
  regression: `assert 'WooldridgeDiD(family="poisson")' not in text`.

CHANGELOG updated to reflect the revised TreatmentDoseShape scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…ontinuousDiD prerequisite summaries

Reviewer correctly noted that the round-2 wording lists
`has_never_treated` + `treatment_varies_within_unit == False` +
`is_balanced` as the "authoritative" ContinuousDiD pre-fit gates but
omits the duplicate-cell hard stop. Verified
`continuous_did.py:_precompute_structures` (line 818-823) builds
`outcome_matrix` cell-by-cell with last-row-wins on duplicate
`(unit, time)` keys - so absence of the `duplicate_unit_time_rows`
alert is also a real prerequisite, not just a style preference.

Updated wording in five places to add "+ absence of the
`duplicate_unit_time_rows` alert" alongside the other gates and
explain the silent-overwrite behavior:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring
- `diff_diff/guides/llms-autonomous.txt` §2 field reference
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature)
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning
  chain (now lists four gates instead of three)
- `CHANGELOG.md` Unreleased entry
- `ROADMAP.md` AI-Agent Track building-block

Also softened "authoritative" -> "core field-based" since the
non-field-based duplicate-row gate makes the original phrasing
slightly misleading.

Added a test_guides.py regression asserting the autonomous guide
mentions `duplicate_unit_time_rows` so future wording changes can't
silently drop the gate from the summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…e-dose gate to ContinuousDiD prerequisite summaries

P1 (newly identified — `dose_min > 0` ContinuousDiD gate omitted):
Reviewer correctly noted that the round-3 prerequisite summary
(`has_never_treated`, `treatment_varies_within_unit == False`,
`is_balanced`, no `duplicate_unit_time_rows` alert) omits the
estimator's strictly-positive-treated-dose restriction. Verified
`continuous_did.py:287-294` raises `ValueError` ("Dose must be
strictly positive for treated units (D > 0)") on negative treated
dose support — a panel can satisfy every documented gate above and
still hard-fail at fit time when `treatment_dose.dose_min < 0`.

Updated wording in five surfaces to add `treatment_dose.dose_min > 0`
as the fifth pre-fit gate:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells
  out all five gates with numbered list referencing the
  `continuous_did.py:287-294` line range).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (notes
  `dose_min > 0` is itself a gate; the other treatment_dose
  sub-fields remain descriptive).
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design
  feature paragraph).
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example
  reasoning chain (now five gates checked; added a new
  counter-example covering the negative-dose path so an agent
  reading the example sees the contradictory case explicitly).
- `CHANGELOG.md` Unreleased entry.
- `ROADMAP.md` AI-Agent Track building-block.

P2 (regression test for negative-dose path):
Added `test_treatment_dose_min_flags_negative_dose_continuous_panels`
in `tests/test_profile_panel.py` covering a balanced, never-treated,
time-invariant continuous panel where every dose level is negative.
The test asserts that all four other gates pass cleanly and that
`dose.dose_min < 0` is correctly observed — the fixture an agent
would see when reasoning about whether ContinuousDiD applies.

Added `dose_min > 0` content-stability assertion in
`tests/test_guides.py` so future wording changes can't silently drop
the gate from the autonomous-guide summary.

Rebased onto origin/main; resolved CHANGELOG conflicts (kept HAD
Phase 4.5 B + dCDH by_path Wave 2 + my Wave 2 entries side by side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ed57c7cffa81eec1685695f3ad11894729e41372


Overall Assessment

⚠️ Needs changes — the prior negative-dose / Wooldridge Poisson wording issues look resolved, but there is one unmitigated P1 in the new ContinuousDiD routing contract.

Executive Summary

  • The earlier dose_min > 0 omission appears fixed: the guide now reflects ContinuousDiD’s negative-dose rejection, and the Wooldridge Poisson text now matches the ASF outcome-scale estimand.
  • P1 [Newly identified]: the PR repeatedly calls a five-item PanelProfile checklist the full ContinuousDiD pre-fit gate set, but ContinuousDiD.fit() still has separate first_treat validity/consistency gates and still requires P(D=0) > 0 regardless of control_group.
  • The new OutcomeShape / is_count_like heuristic is aligned with the Poisson fitter’s non-negativity requirement and does not look methodologically off.
  • Static review only: pytest is not installed in this environment, so I could not execute the added tests.

Methodology

  • Severity: P1 [Newly identified]. Impact: the new docs present has_never_treated, treatment_varies_within_unit == False, is_balanced, no duplicate_unit_time_rows, and treatment_dose.dose_min > 0 as the full ContinuousDiD gate set. That is not the estimator’s full contract. ContinuousDiD.fit() also validates the separate first_treat column, rejects NaN/negative values, drops first_treat > 0 units with dose == 0, zeroes first_treat == 0 rows with nonzero dose, and enforces P(D=0) > 0 even under control_group="not_yet_treated". An agent can therefore mark a panel as “in scope” from PanelProfile alone even though the actual first_treat input will deterministically warn/error or change the sample. Concrete fix: stop calling the five profile-derived facts the “full” gate set; describe them as the profile-side screening surface and add an explicit caveat that ContinuousDiD.fit() still requires a valid/consistent first_treat column. Refs: diff_diff/guides/llms-autonomous.txt:L203-L235, diff_diff/guides/llms-autonomous.txt:L514-L526, diff_diff/guides/llms-autonomous.txt:L758-L788, diff_diff/profile.py:L65-L99, diff_diff/continuous_did.py:L230-L360, docs/methodology/REGISTRY.md:L719-L745, docs/methodology/continuous-did.md:L182-L184, ROADMAP.md:L141, CHANGELOG.md:L12.

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity: P3. Impact: the new internal commentary is inconsistent. Public docs now say treatment_dose.dose_min > 0 is part of the profile-side ContinuousDiD screen, but _compute_treatment_dose() and test_treatment_dose_does_not_gate_continuous_did() still say treatment-dose fields “do not gate” ContinuousDiD. That inconsistency will make future regressions more likely. Concrete fix: align the helper/test docstrings with the public contract (“most fields descriptive; dose_min > 0 is the extra screen”). Refs: diff_diff/profile.py:L65-L99, diff_diff/profile.py:L689-L694, tests/test_profile_panel.py:L1110-L1128.

Tech Debt

No findings. The P1 above is not mitigated by TODO.md tracking or a registry deviation note.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests pin the negative-dose wording and ASF outcome-scale wording, but they do not protect the missing first_treat caveat that causes the P1 above. Concrete fix: add a guide regression asserting that the ContinuousDiD section describes the five PanelProfile facts as screening checks only and explicitly mentions separate first_treat validation/consistency. Refs: tests/test_guides.py:L37-L69, diff_diff/guides/llms-autonomous.txt:L758-L788, diff_diff/continuous_did.py:L230-L360.

Path to Approval

  1. Replace every “full ContinuousDiD pre-fit gate set” claim in CHANGELOG.md, ROADMAP.md, diff_diff/profile.py, and diff_diff/guides/llms-autonomous.txt with “profile-side screening checks,” and document the separate first_treat contract from ContinuousDiD.fit().
  2. In llms-autonomous.txt §5.2, rewrite the has_never_treated == False counterexample so it states that the current implementation requires P(D=0) > 0 under both control_group options, not just the default.
  3. Add a regression in tests/test_guides.py that requires the ContinuousDiD section to mention separate first_treat validation/consistency; optionally pair it with a small tests/test_continuous_did.py scenario where the five profile-side checks pass but inconsistent first_treat still warns/errors.

igerber added a commit that referenced this pull request Apr 25, 2026
…nuousDiD prerequisite list as profile-side screening + add first_treat caveat

P1 (the five profile-derived facts are not the "full" gate set):
Reviewer correctly noted that calling
`{has_never_treated, treatment_varies_within_unit==False,
is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}` the
"full ContinuousDiD pre-fit gate set" overreaches. `profile_panel`
only sees the four columns it accepts and CANNOT see the separate
`first_treat` column that `ContinuousDiD.fit()` consumes. Verified
against `continuous_did.py:230-360`: `fit()` additionally rejects
NaN/inf/negative `first_treat`, drops units with `first_treat > 0`
AND `dose == 0`, and force-zeroes `first_treat == 0` rows whose
`dose != 0` with a `UserWarning`. A panel that passes all five
profile-side checks can still surface warnings, drop rows, or raise
at fit time depending on the `first_treat` column the caller
supplies.

Reframed the wording in five surfaces from "full gate set" to
"profile-side screening checks" with an explicit caveat that the
checks are necessary-but-not-sufficient and that `ContinuousDiD.fit()`
applies separate `first_treat` validation:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells
  out the screening framing explicitly + lists the `first_treat`
  validations that fit() applies).
- `diff_diff/profile.py` `_compute_treatment_dose` helper docstring
  (aligned with public contract: most fields descriptive,
  `dose_min > 0` is one of the screening checks).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (rewrote
  the multi-paragraph block to describe screening + first_treat
  caveat).
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design
  feature paragraph: screening checks + necessary-not-sufficient
  language + pointer to §2).
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example
  reasoning chain (rewrote step 2 to call out screening +
  first_treat caveat; clarified counter-example #4 that
  `P(D=0) > 0` is required under BOTH `control_group="never_treated"`
  and `"not_yet_treated"`, not just default).
- `CHANGELOG.md` Unreleased entry.
- `ROADMAP.md` AI-Agent Track.

P2 (test coverage for the missing `first_treat` caveat):
Added a content-stability assertion in `tests/test_guides.py`:
`assert "first_treat" in text` so the autonomous guide cannot
silently drop the explicit `first_treat` validation caveat.

P3 (helper / test-name inconsistency with public contract):
Renamed `test_treatment_dose_does_not_gate_continuous_did` to
`test_treatment_dose_descriptive_fields_supplement_existing_gates`
and rewrote its docstring to match the now-honest public contract
("most fields descriptive distributional context that supplements
the existing top-level screening checks"). The test body still
asserts the same two things — `treatment_varies_within_unit` fires
True on `0,0,d,d` paths and `has_never_treated` is independent of
`has_zero_dose` — both of which remain accurate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8601b611fa2c21684b91ecb58bf4b5df314557d6


Overall Assessment

Looks good — the prior P1 is resolved, and I did not find any new unmitigated P0/P1 issues in the changed code/docs.

Executive Summary

  • Prior re-review P1 is resolved: the ContinuousDiD language now consistently frames the five PanelProfile facts as a profile-side pre-flight screen and separately documents first_treat validation/coercion/drop behavior (diff_diff/profile.py:L65-L115, diff_diff/guides/llms-autonomous.txt:L203-L250, diff_diff/guides/llms-autonomous.txt:L530-L548, diff_diff/guides/llms-autonomous.txt:L780-L812, CHANGELOG.md:L12, ROADMAP.md:L141).
  • The new OutcomeShape.is_count_like routing is aligned with Wooldridge Poisson fit behavior by excluding negative-support outcomes, and the guide now correctly describes Poisson overall_att as an ASF-based outcome-scale difference (diff_diff/profile.py:L669-L683, docs/methodology/REGISTRY.md:L1333-L1348, diff_diff/wooldridge.py:L1102-L1109, diff_diff/wooldridge.py:L1191-L1245).
  • No estimator math, weighting, variance/SE, identification assumptions, or defaults changed in the implementation; this PR is limited to descriptive profiling fields, exports, guide content, and tests.
  • One minor documentation drift remains: the changelog summary omits the non-negativity clause from the is_count_like heuristic (CHANGELOG.md:L12).
  • Added tests cover the new profile fields and the prior review regression surface (tests/test_guides.py:L37-L185, tests/test_profile_panel.py:L889-L1313), but I could not execute them here because pytest and runtime deps such as numpy are unavailable.

Methodology

No findings. The changed docs now match the ContinuousDiD contract in docs/methodology/REGISTRY.md:L717-L745 and diff_diff/continuous_did.py:L222-L327,L348-L360, and the Wooldridge Poisson wording matches docs/methodology/REGISTRY.md:L1333-L1348 and diff_diff/wooldridge.py:L1191-L1245.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: CHANGELOG.md:L12 describes is_count_like without the value_min >= 0 / non-negative-support gate. Readers relying on the release notes alone may expect negative-valued outcomes to route as count-like even though the implementation intentionally suppresses that to stay compatible with WooldridgeDiD(method="poisson").
    Concrete fix: Update the changelog bullet to include the non-negativity clause, matching diff_diff/profile.py:L669-L683, diff_diff/guides/llms-autonomous.txt:L188-L198, and tests/test_profile_panel.py:L930-L971.

igerber added a commit that referenced this pull request Apr 25, 2026
…se in CHANGELOG is_count_like description

Reviewer correctly noted that the CHANGELOG bullet describing
`is_count_like` listed only four of the five conditions
(integer-valued + has zeros + right-skewed + > 2 distinct values)
but omitted the `value_min >= 0` non-negativity clause added in
round 4. Readers of the release notes alone would have expected
negative-valued count-like outcomes to route as `is_count_like=True`
even though the implementation intentionally suppresses that to stay
compatible with `WooldridgeDiD(method="poisson")`'s hard non-negative
requirement (`wooldridge.py:1105-1109`).

Updated the bullet to include the non-negativity clause and
explicitly cite the wooldridge.py line range so the release-notes
description matches `diff_diff/profile.py`,
`diff_diff/guides/llms-autonomous.txt`, and the
`test_outcome_shape_count_like_excludes_negative_support`
regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bc13abb50011b5a5495d6448d458cff7dfa157e4


Overall Assessment

✅ Looks good

Executive Summary

Methodology

No findings. Cross-checking the changed profile/guide language against docs/methodology/REGISTRY.md:L717-L745 and docs/methodology/REGISTRY.md:L1333-L1349 found no undocumented deviation in estimator behavior.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: §4.7 contains two different summaries of the ContinuousDiD pre-fit contract. The opening bullet says there are only three eligibility prerequisites, while the later paragraph correctly adds the duplicate-row hard stop and the negative-dose screen. A reader who stops at the first summary can miss two load-bearing gates and route a duplicate-containing or negative-dose panel into a failed or silent-overwrite fit path (diff_diff/guides/llms-autonomous.txt:L507-L548, diff_diff/continuous_did.py:L222-L338, diff_diff/continuous_did.py:L800-L823).
    Concrete fix: Update the opening ContinuousDiD bullet in §4.7 to enumerate all five profile-side checks, or relabel the first three as “core estimator prerequisites” and explicitly cross-reference the fuller five-check screen immediately below.

Static review only: the added coverage in tests/test_profile_panel.py:L887-L1313 and tests/test_guides.py:L37-L182 looks appropriate, but I could not run it in this environment.

igerber added a commit that referenced this pull request Apr 25, 2026
…D opening bullet with the five-check screen below it

§4.7 had two summaries of the ContinuousDiD pre-fit contract that
disagreed: the opening bullet (Wave 1 era) said "Three eligibility
prerequisites", while the paragraph immediately below correctly
listed all five profile-side screening checks (Wave 2 added the
duplicate-row hard stop and the negative-dose `dose_min > 0`
screen). A reader who stopped at the opening bullet could miss
those two load-bearing gates and route a duplicate-containing or
negative-dose panel into a failed/silent-overwrite fit path.

Updated the opening bullet to enumerate all five profile-side
checks: (a) zero-dose controls / `P(D=0) > 0`, (b) per-unit
time-invariant dose, (c) balanced panel, (d) no
`duplicate_unit_time_rows` alert, (e) strictly positive treated
doses. Also clarified that the duplicate-row case (d) coerces
silently rather than raising — distinct from (a)/(b)/(c)/(e) which
all raise ValueError. Pointer to the paragraph below + §2 for the
`first_treat` validation surface kept intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a8880b46a47e39fb3c89a60718cc9b80a7bc2994


Overall Assessment

✅ Looks good

Executive Summary

  • Prior re-review issue is resolved: the autonomous guide now consistently names all five ContinuousDiD profile-side checks and the separate first_treat validation surface.
  • No estimator implementation, weighting, variance/SE computation, identification assumption, or default behavior changed in this PR; the code changes are confined to descriptive PanelProfile metadata, exports, guide prose, and tests.
  • One methodology-facing docs issue remains: the new count-outcome guidance justifies Poisson QMLE by saying linear DiD asymptotic SEs assume “normal-shaped errors,” which does not match the library’s documented cluster-robust OLS inference contract.
  • One minor docs clarity issue remains: the new ContinuousDiD worked example is written like a zero-pre/post-dose adoption path even though the estimator, the profile gate, and the same section later reject 0,0,d,d dose paths.
  • Static review only: I could not run the added tests here because this environment does not have pytest, numpy, or pandas.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: The new ContinuousDiD worked example is titled “zero baseline” and describes “block-style adoption in year 3,” but the shown profile has treatment_varies_within_unit=False, and the same section later correctly says a 0,0,d,d path is out of scope. That mixed framing can confuse an agent about the estimator’s full-path dose-invariance contract. Location: diff_diff/guides/llms-autonomous.txt:L759-L781, diff_diff/guides/llms-autonomous.txt:L822-L828, diff_diff/continuous_did.py:L222-L228. Concrete fix: rename the example to something like “continuous-dose panel with zero-dose controls,” and state explicitly that positive-dose units keep the same dose in every observed period while first_treat carries the timing information separately.

No additional test-coverage findings in the diff. Testing note: I could not execute tests/test_profile_panel.py or tests/test_guides.py in this environment because the required Python packages are not installed.

igerber and others added 3 commits April 25, 2026 06:45
…ationale in functional-form terms; rename §5.2 + clarify dose constancy

P2 (linear-OLS SE rationale wording):
The §4.11 and §5.3 prose justified `WooldridgeDiD(method="poisson")`
over linear-OLS DiD by claiming linear-OLS asymptotic SEs "assume
normal-shaped errors" that a right-skewed count distribution
violates. That misrepresents the library's documented inference
contract: linear ETWFE uses cluster-robust sandwich SEs (`OLS path`
in REGISTRY.md) which are valid for any error distribution with
finite moments — not just Gaussian.

Rewrote both surfaces in functional-form / efficiency terms:
- Linear-OLS DiD imposes an additive functional form on a
  non-negative count outcome. SEs are calibrated (cluster-robust)
  but the linear model can be inefficient and may produce
  counterfactual predictions outside the non-negative support.
- `WooldridgeDiD(method="poisson")` imposes a multiplicative
  (log-link) functional form that respects non-negativity and
  matches the typical generative process for count data; QMLE
  sandwich SEs are robust to distributional misspecification.
- The choice is about WHICH functional form best summarizes the
  treatment effect (additive vs multiplicative), not about
  whether SEs are calibrated.

§5.3 reasoning chain step 2 + step 3 rewritten to reflect this:
both estimators are now described as having calibrated inference;
the trade-off is parameterization (linear ATT vs. ASF outcome-scale
ATT). The Poisson non-negativity gate explanation is preserved.

P3 (§5.2 example self-contradiction):
§5.2 was titled "Continuous-dose panel with zero baseline" with
prose "block-style adoption in year 3", which suggested a
`0,0,d,d` within-unit dose path — but the shown profile has
`treatment_varies_within_unit=False` (per-unit constant dose) and
the same section's later counter-example correctly says `0,0,d,d`
is out of scope for ContinuousDiD. Self-contradictory framing.

Renamed to "Continuous-dose panel with zero-dose controls" (per
reviewer suggestion) and clarified the prose: each dose group holds
its assigned dose value (including 0 for the never-treated
controls) in every observed period; adoption timing is carried
separately via the `first_treat` column passed to
`ContinuousDiD.fit()`, not via within-unit dose variation. Updated
the matching `test_autonomous_contains_worked_examples_section`
assertion in `tests/test_guides.py` to track the new title.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tcome rationale to functional-form framing

Round 9 rewrote §4.11 detailed bullet and §5.3 reasoning chain to
describe the Poisson-vs-linear choice in functional-form / efficiency
terms, but two surfaces still carried the old SE-based framing:

- §2 `is_count_like` field reference: "questionable asymptotic SEs"
- §4.11 preamble paragraph: "asymptotic SEs may mislead"

Both phrases conflict with the library's documented inference
contract (REGISTRY.md §WooldridgeDiD: linear ETWFE uses
cluster-robust sandwich SEs which are valid for any error
distribution with finite moments — not just Gaussian). Reviewer
correctly flagged this as the remaining methodology P2.

Updated both surfaces to match the round-9 framing:
- Linear-OLS DiD imposes an additive functional form on a
  non-negative count outcome; cluster-robust SEs remain
  calibrated but the linear model can be inefficient and may
  produce counterfactual predictions outside the non-negative
  support.
- `WooldridgeDiD(method="poisson")` is the multiplicative
  (log-link) ETWFE alternative that respects the non-negative
  support and matches the typical count-process generative model,
  with QMLE sandwich SEs robust to distributional misspecification.
- The choice is functional-form / efficiency / support, NOT SE
  calibration.

Verified no remaining "asymptotic SEs" / "questionable" / "mislead"
phrases in the guide via grep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…light checks as standard-workflow predictions, not estimator gates

Reviewer correctly noted that calling
{has_never_treated, treatment_varies_within_unit==False,
is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}
the "screening checks" / "necessary" gates of `ContinuousDiD`
overstates the contract. `ContinuousDiD.fit()` keys off the
separate `first_treat` column (which `profile_panel` does not see),
defines never-treated controls as `first_treat == 0` rows,
force-zeroes nonzero `dose` on those rows with a `UserWarning`,
and rejects negative dose only among treated units `first_treat > 0`
(see `continuous_did.py:276-327` and `:348-360`).

Two of the five checks (`has_never_treated`, `dose_min > 0`) are
first_treat-dependent: agents who relabel positive- or negative-dose
units as `first_treat == 0` trigger the force-zero coercion path
with a `UserWarning` and may still fit panels that fail those
preflights, with the methodology shifting. The other three
(`treatment_varies_within_unit`, `is_balanced`, duplicate-row
absence) are real fit-time gates that hold regardless of how
`first_treat` is constructed.

Reframed every wording site to call these "standard-workflow
preflight checks" — predictive when the agent derives `first_treat`
from the same dose column passed to `profile_panel`, but not the
estimator's literal contract:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote
  the multi-paragraph block; explicit standard-workflow definition
  + per-check first_treat dependency map + force-zero coercion
  caveat).
- `diff_diff/profile.py` `_compute_treatment_dose` helper docstring
  (already brief; stays consistent).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (long
  rewrite covering the standard-workflow framing + override paths).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
  trailing paragraph (both updated; opening bullet now spells out
  which of the five checks are first_treat-dependent vs. hard
  fit-time stops; trailing paragraph promotes the standard-
  workflow framing).
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain step
  2 (rewrote the gate-checking paragraph; counter-example #4
  expanded to enumerate (a) supply matching first_treat and accept
  rejection, (b) deliberate relabel + coercion, (c) different
  estimator; counter-example #5 distinguishes negative-dose
  treated-unit rejection from never-treated coercion).
- `CHANGELOG.md` Wave 2 entry (matches the new framing).
- `ROADMAP.md` AI-Agent Track building block (matches).

Test coverage:
- Renamed assertion messages in
  `test_treatment_dose_descriptive_fields_supplement_existing_gates`
  and `test_treatment_dose_min_flags_negative_dose_continuous_panels`
  to remove "authoritative gate" phrasing; reframed as "standard-
  workflow preflight" assertions consistent with the corrected docs.
- Added `test_negative_dose_on_never_treated_coerces_not_rejects`
  in `tests/test_continuous_did.py::TestEdgeCases` covering the
  reviewer's specific request: never-treated rows with NEGATIVE
  nonzero dose must coerce (with `UserWarning`) rather than raise
  the treated-unit negative-dose error. Sister to the existing
  `test_nonzero_dose_on_never_treated_warns` which covers the
  positive-dose case.

Rebased onto origin/main during this round (no conflicts beyond
prior CHANGELOG resolutions; main advanced 19 commits).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the agent-profile-shape-extensions branch from 31e8875 to 25a1197 Compare April 25, 2026 10:53
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 25a1197a1d7f7f73fff10107bb5e49366c3cceb9


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review focus: the earlier ContinuousDiD “authoritative gate” problem is mostly fixed. The new text now correctly treats has_never_treated and dose_min > 0 as standard-workflow, first_treat-dependent preflight checks rather than universal fit-time gates.
  • No estimator math, weighting, variance/SE, or default-fit behavior changed in this diff.
  • P1 The new agent-facing docs now go beyond the documented coercion note and present deliberate first_treat == 0 relabeling of nonzero-dose units as an agent option to “manufacture controls” / still fit ContinuousDiD. That is not supported by the Methodology Registry or the estimator’s own requirement that actual P(D=0) > 0 exist.
  • P2 Some new wording also says duplicate-row failures are “hard fit-time stops,” but ContinuousDiD still silently applies last-row-wins on duplicate (unit, time) cells.
  • Static review only: python -m pytest tests/test_profile_panel.py tests/test_guides.py tests/test_continuous_did.py -q could not be run because pytest is not installed.

Methodology

  • P1 Impact: the prior mismatch is not fully closed because the new docs treat intentional relabeling of positive-dose / nonzero-dose units to first_treat == 0 as an available fallback for ContinuousDiD (diff_diff/guides/llms-autonomous.txt:L241-L247, diff_diff/guides/llms-autonomous.txt:L893-L900, diff_diff/profile.py:L88-L97, ROADMAP.md:L141-L141, CHANGELOG.md:L13-L13). The registry and implementation only document this as inconsistent-input coercion with a warning, while still requiring actual P(D=0) > 0 because Remark 3.1 is unimplemented (docs/methodology/REGISTRY.md:L719-L745, diff_diff/continuous_did.py:L319-L360). An autonomous agent following the worked example could therefore alter the data to create synthetic controls without any documented methodological basis. Concrete fix: remove/rewrite the “relabel to first_treat == 0” fallback everywhere; keep the coercion note as implementation behavior for bad/inconsistent inputs, not as a valid routing option. When has_never_treated == False or dose_min <= 0, direct users to re-encode treatment or choose a different estimator unless REGISTRY is updated with a documented deviation.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the documentation inconsistency noted below.

Tech Debt

  • No findings. I did not see this issue tracked in TODO.md.

Security

  • No findings.

Documentation/Tests

  • P2 Impact: some new text now says the duplicate-row check is a “hard fit-time stop” (diff_diff/profile.py:L117-L122, diff_diff/guides/llms-autonomous.txt:L864-L866), but the estimator still silently overwrites duplicate (unit, time) rows in _precompute_structures() (diff_diff/continuous_did.py:L820-L823). That contradiction can make readers rely on fit-time protection that does not exist. Concrete fix: change this wording to “hard preflight veto / must deduplicate before fit because ContinuousDiD otherwise uses last-row-wins,” matching the actual implementation and the other accurate duplicate-row warnings already present in the guide.
  • Static note: tests were not executed here because pytest is unavailable in the environment.

Path to Approval

  1. Remove the new “manufacture controls / relabel to first_treat == 0” fallback from the guide, profile docstring, changelog, and roadmap, and replace it with wording that matches REGISTRY.md and ContinuousDiD.fit() exactly.
  2. Update the §5.2 continuous-dose worked example and related prose so the no-never-treated / negative-dose counterexamples route agents to re-encoding or alternative estimators, not deliberate relabeling.
  3. Add or update guide/profile regression tests so they lock in the corrected wording and do not normalize relabeling as an admissible methodology.

…s-fallback wording; correct duplicate-row "fit-time stop" claim

P1 (relabel-to-manufacture-controls misframing):
Round 11 introduced wording across the guide, profile docstring,
CHANGELOG, ROADMAP, and test docstrings that presented intentional
`first_treat == 0` relabeling of nonzero-dose units as an
"option" / "fallback" for fitting `ContinuousDiD` when the
profile-side preflights (`has_never_treated`, `dose_min > 0`)
fail. REGISTRY does not document this as a routing option, and the
estimator still requires actual `P(D=0) > 0` because Remark 3.1
lowest-dose-as-control is not yet implemented. The force-zero
coercion at `continuous_did.py:311-327` is implementation behavior
for INCONSISTENT inputs (e.g., user accidentally passes nonzero
dose on a never-treated row), not a methodology fallback.

Reworded every site to remove the relabeling-as-option framing and
replace it with the registry-documented fixes when (1) or (5)
fails: re-encode the treatment column to a non-negative scale that
contains a true never-treated group, or route to a different
estimator (`HeterogeneousAdoptionDiD` for graded-adoption panels;
linear DiD with the treatment as a continuous covariate). Every
remaining "manufacture controls" mention in the guide, profile,
and tests is now an explicit anti-recommendation ("do not relabel
... to manufacture controls"). Updated:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (item (1):
  "not an opportunity to relabel ..."; item (5): coercion is
  "implementation behavior for inconsistent inputs, not a
  methodological fallback").
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (the
  When-(1)-or-(5)-fails paragraph names re-encode + alternative
  estimator only; explicit anti-relabel warning).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
  trailing paragraph (consolidated; opening bullet drops the
  relabel-as-fallback framing; trailing paragraph trimmed to a
  pointer to §2).
- `diff_diff/guides/llms-autonomous.txt` §5.2 step 2 + counter-
  example #4 + counter-example #5 (relabel-as-option language
  removed; explicit "do not relabel" callouts; counter-example #4
  options trimmed to (a) re-encode and (b) different estimator).
- `CHANGELOG.md` (relabel-as-option clause removed; replaced with
  re-encode / different-estimator framing).
- `ROADMAP.md` (same).
- `tests/test_profile_panel.py` two test docstrings (relabel-as-
  workflow language removed).

P2 (duplicate-row "hard fit-time stop" misclaim):
Round 11 wording said "duplicate-row failures are hard fit-time
stops" — incorrect. `_precompute_structures` at
`continuous_did.py:818-823` silently overwrites with last-row-wins,
no exception raised. Reworded as "hard preflight veto: the agent
must deduplicate before fit because `ContinuousDiD` otherwise uses
last-row-wins, no fit-time exception" in profile.py docstring,
guide §4.7 opening bullet, and §5.2 step 2 (now defers to §2 for
the breakdown). The previously-correct §2 description of the
silent-coerce path is preserved.

Length housekeeping:
The round-11 round-12 expansion pushed `llms-autonomous.txt`
above `llms-full.txt`, breaking `test_full_is_largest`. Trimmed
~2.7KB by consolidating the §4.7 trailing paragraph + §5.2 step 2
trailing block to point at §2's full breakdown rather than
duplicating the per-check semantics. autonomous: 65364 chars,
full: 66058 chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 307f424c9580d2739ab4b05313faaec4bb3b6f2e


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review: the earlier user-facing “manufacture controls” guidance appears fixed in the guide/profile prose.
  • P1 The replacement ContinuousDiD preflight story is still wrong: it tells agents to derive first_treat from the same dose column that profile_panel() inspects, but the method/library contract uses a time-invariant unit dose plus a separate timing variable.
  • The contradiction is internal to the new §5.2 example itself: it first describes constant per-unit dose with separate first_treat, then bases the screening logic on inferring first_treat from that same dose column.
  • The same assumption is now duplicated in diff_diff/profile.py and new tests/test_profile_panel.py regression prose, so the mismatch will persist unless those surfaces are corrected together.
  • No estimator math, weighting, variance, or inference code changed in this diff.
  • Static review only: this environment lacks numpy/pytest, so I could not execute the targeted test files.

Methodology

  • P1 Impact: the new “standard workflow” in diff_diff/guides/llms-autonomous.txt:L219-L239, diff_diff/guides/llms-autonomous.txt:L530-L563, diff_diff/guides/llms-autonomous.txt:L801-L852, diff_diff/profile.py:L72-L130, and tests/test_profile_panel.py:L1111-L1199 says agents can derive first_treat from the same dose column that profile_panel() sees. That conflicts with the method contract: multi-period continuous DiD uses a time-invariant unit dose D_i and a separate timing group G_i/first_treat (docs/methodology/continuous-did.md:L65-L73), ContinuousDiD.fit() hard-rejects within-unit dose variation (diff_diff/continuous_did.py:L222-L228), and the library’s own generator emits constant dose with separate first_treat (diff_diff/prep_dgp.py:L970-L993). REGISTRY documents the untreated-group and balanced-panel requirements, but not any “derive timing from dose” workflow (docs/methodology/REGISTRY.md:L717-L723). An agent following the new guidance must either build a 0,0,d,d path to recover timing, which fit() rejects, or keep a valid constant-dose panel, in which case the dose column cannot identify first_treat at all. Concrete fix: remove the “derive first_treat from the same dose column” workflow from the guide/profile text and replace it with the actual contract: profile_panel() on the dose column alone can only provide descriptive dose-support checks; timing-dependent ContinuousDiD eligibility requires a separate first_treat column already supplied by the caller, or a new helper that validates dose and first_treat together.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • P2 Impact: PanelProfile is a public top-level type (diff_diff/__init__.py:L503-L506), but the PR adds two new required dataclass fields without defaults (diff_diff/profile.py:L179-L188). Any direct caller that instantiates PanelProfile(...) manually will now fail with TypeError, even though both additions are semantically optional. Concrete fix: make outcome_shape and treatment_dose backward-compatible additions, e.g. keyword-only fields with defaults, and add a regression test for direct construction.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P3 Impact: the new docstring in tests/test_continuous_did.py:L903-L911 still says the coercion path “lets agents legally relabel negative-dose units as first_treat == 0,” which contradicts the corrected guide/profile/registry position. The behavior under test is fine; the comment is not. Concrete fix: rewrite the docstring to say coercion is observed implementation behavior for inconsistent inputs, not a sanctioned routing option.
  • Residual risk: static review only; I could not run tests/test_profile_panel.py, tests/test_guides.py, or tests/test_continuous_did.py because the environment is missing numpy/pytest.

Path to Approval

  1. Remove/rewrite the “derive first_treat from the same dose column” workflow in diff_diff/guides/llms-autonomous.txt, diff_diff/profile.py, CHANGELOG.md, and ROADMAP.md.
  2. Rewrite §5.2 and the new tests/test_profile_panel.py regression prose around the actual ContinuousDiD contract: time-invariant unit dose D_i, separate first_treat/G_i.
  3. If predictive pre-fit screening is still desired, add a helper or optional first_treat= input that validates dose and first_treat together; otherwise explicitly narrow the docs to descriptive dose-support checks only.

… first_treat from dose" framing; add PanelProfile backward-compat defaults; fix test_continuous_did docstring

P1 (canonical ContinuousDiD setup vs. derive-from-dose framing):
Round 12 introduced a "standard workflow" description across the
guide, profile docstring, CHANGELOG, ROADMAP, and test docstrings
that said agents derive `first_treat` from the same dose column
passed to `profile_panel`. Reviewer correctly noted this conflicts
with the actual ContinuousDiD contract (`continuous_did.py:222-228`,
`prep_dgp.py:970-993`, `docs/methodology/continuous-did.md:65-73`):
the canonical setup uses a **time-invariant per-unit dose** `D_i`
and a **separate `first_treat` column** the caller supplies — the
dose column has no within-unit time variation in this setup, so it
cannot encode timing. An agent following the rejected framing would
either build a `0,0,d,d` path (which `fit()` rejects) or keep a
valid constant-dose panel (in which case the dose column carries no
timing information).

Reworded every site to drop the derive-from-dose framing and
replace with the canonical setup. The five facts on the dose column
remain predictive of `fit()` outcomes BECAUSE the canonical
convention ties `first_treat == 0` to `D_i == 0` and treated units
carry their constant dose across all periods — so `has_never_treated`
proxies `P(D=0) > 0` and `dose_min > 0` predicts the strictly-
positive-treated-dose requirement, without any "derivation" of
`first_treat` from the dose column. Updated:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote
  the multi-paragraph block to use the canonical-setup framing
  and added an explicit "agent must validate `first_treat`
  independently" note).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference.
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet.
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain
  step 2 + counter-examples #4 and #5 (now describe the
  canonical setup rather than a derive-from-dose workflow).
- `CHANGELOG.md` Wave 2 entry.
- `ROADMAP.md` AI-Agent Track building block.
- `tests/test_profile_panel.py` `test_treatment_dose_min_flags
  _negative_dose_continuous_panels` docstring/comments.

P2 (PanelProfile direct-construction backward compat):
Wave 2 added `outcome_shape` and `treatment_dose` to PanelProfile
without defaults, breaking direct `PanelProfile(...)` calls that
predate Wave 2. Made both fields default to `None` (moved them to
the end of the field list; both are `Optional[...]`). Added
`test_panel_profile_direct_construction_without_wave2_fields`
asserting that direct construction without the new fields succeeds
and yields `None` defaults that serialize correctly through
`to_dict()`.

P3 (test_continuous_did.py docstring overstating sanction):
The new `test_negative_dose_on_never_treated_coerces_not_rejects`
docstring said the contract "lets agents legally relabel
negative-dose units as `first_treat == 0` to coerce them away."
Reworded as observed implementation behavior for inconsistent
inputs, NOT a sanctioned routing option — the test locks in the
coercion contract while the autonomous guide §5.2 explicitly tells
agents not to use this path methodologically.

Length invariant maintained: autonomous (65748 chars) < full
(66031 chars); `test_full_is_largest` still passes (compares
character count, not byte count, so on-disk size with UTF-8
multi-byte characters differs from the assertion target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 60014f5c25cc54e67380a1d0c59cf93b62dc0f72


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review: the prior P1 about telling agents to derive first_treat from the dose column is resolved on the public-facing surfaces. The updated ContinuousDiD guidance now matches the canonical contract: time-invariant unit dose D_i, separate caller-supplied first_treat, and separate fit-time validation of first_treat (diff_diff/profile.py:L65-L141, diff_diff/guides/llms-autonomous.txt:L206-L283, diff_diff/guides/llms-autonomous.txt:L810-L919, diff_diff/continuous_did.py:L222-L360).
  • The new OutcomeShape.is_count_like guidance is consistent with the actual Poisson path: WooldridgeDiD(method="poisson") rejects negative outcomes, and the guide now correctly describes the reported overall_att as an ASF-based outcome-scale difference rather than a multiplicative reported estimand (diff_diff/profile.py:L700-L725, diff_diff/wooldridge.py:L1102-L1245, docs/methodology/REGISTRY.md:L1333-L1349).
  • The public API propagation looks complete for this PR: the new dataclasses are exported from diff_diff, and PanelProfile keeps backward compatibility by defaulting the new fields (diff_diff/__init__.py:L252-L258, diff_diff/__init__.py:L503-L509, diff_diff/profile.py:L189-L193, tests/test_profile_panel.py:L1325-L1362).
  • P3: two new regression docstrings still restate the old “derive first_treat from dose” workflow. This is test-only prose, not a runtime or methodology bug, but it leaves stale language in the tree (tests/test_profile_panel.py:L1110-L1137, tests/test_profile_panel.py:L1184-L1199).
  • Static review only: I could not run the touched tests in this environment because pytest is unavailable (python -m pytest ... failed with No module named pytest).

Methodology

  • No unmitigated findings. The affected methods are ContinuousDiD and WooldridgeDiD, but only through pre-fit descriptive guidance, not estimator math or inference. The updated text aligns with the ContinuousDiD contract in docs/methodology/continuous-did.md:L65-L76, docs/methodology/continuous-did.md:L265-L269, docs/methodology/REGISTRY.md:L717-L745, and diff_diff/continuous_did.py:L222-L360, and with the Poisson ASF path documented for WooldridgeDiD in docs/methodology/REGISTRY.md:L1333-L1349 and implemented at diff_diff/wooldridge.py:L1102-L1245.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: tests/test_profile_panel.py:L1110-L1137 and tests/test_profile_panel.py:L1184-L1199 still describe a “standard ContinuousDiD workflow” where first_treat is derived from the same dose column. That contradicts the now-corrected contract in diff_diff/profile.py:L65-L141 and diff_diff/guides/llms-autonomous.txt:L206-L283, diff_diff/guides/llms-autonomous.txt:L810-L919. It will not change runtime behavior, but it can mislead future editors and make this methodology regression easier to reintroduce.
    Concrete fix: rewrite those test docstrings/comments to the canonical formulation already used elsewhere: time-invariant per-unit dose D_i, separate caller-supplied first_treat, and profile-side screening only on the dose column.

  • Residual risk: static review only; targeted pytest execution was not possible in this environment because pytest is not installed.

…derive first_treat from dose" wording from two test docstrings

Round 13 dropped the "standard workflow / derive first_treat from
dose" framing across the public-facing surfaces (profile.py
docstring, autonomous guide §2/§4.7/§5.2, CHANGELOG, ROADMAP) but
left two test docstrings carrying the rejected framing:

- `test_treatment_dose_descriptive_fields_supplement_existing_gates`
- `test_treatment_dose_min_flags_negative_dose_continuous_panels`

These are test-only prose, not runtime or methodology bugs, but
the stale language could mislead future editors and make the same
methodology regression easier to reintroduce. Reworded both
docstrings to use the canonical formulation now used elsewhere:
time-invariant per-unit dose `D_i`, separate caller-supplied
`first_treat`, profile-side screening on the dose column only,
and explicit reference to "with a `first_treat` column consistent
with the dose column on per-unit treated/untreated status" rather
than "agent derives `first_treat` from the dose column".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: aea7b1d11d1c852e30fa6b5e2e687d68edd1b9a9


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review concern about stale ContinuousDiD prose appears resolved: the touched guide/tests now consistently describe the canonical setup as time-invariant unit dose D_i plus a separate caller-supplied first_treat column.
  • Severity P1: the new ContinuousDiD guidance repeatedly treats treatment “re-encoding” as a documented or registry-backed fix when P(D=0)=0 or treated doses are negative, but the methodology docs only document that the current implementation requires P(D=0)>0 and that Remark 3.1’s lowest-dose-as-control path is not implemented.
  • No estimator math or inference code changed in this PR, and I did not find new NaN/inference anti-patterns or parameter-propagation gaps in the touched runtime code.
  • The new OutcomeShape / TreatmentDoseShape API surface is otherwise propagated cleanly: top-level exports, backward-compatible PanelProfile defaults, and to_dict() serialization all line up.
  • Static review only: I could not run the touched tests here because the available python3 environment lacks numpy and pytest.

Methodology

  • Severity: P1
    Impact: ContinuousDiD’s documented contract is still “requires P(D=0)>0,” and the paper’s no-untreated-group fallback is the lowest-dose-as-control remark, which the registry explicitly marks as not implemented (docs/methodology/REGISTRY.md:L719-L745, docs/methodology/continuous-did.md:L183-L184, docs/methodology/continuous-did.md:L265-L269). The PR’s new agent-facing text instead says the “documented fixes” / “registry-documented options” are to re-encode the treatment scale into a true never-treated group or non-negative support (diff_diff/profile.py:L82-L96, diff_diff/profile.py:L113-L121, diff_diff/guides/llms-autonomous.txt:L244-L247, diff_diff/guides/llms-autonomous.txt:L562-L565, diff_diff/guides/llms-autonomous.txt:L593-L595, diff_diff/guides/llms-autonomous.txt:L893-L911). That is a new, undocumented methodological recommendation on the library’s agent surface, and the new tests codify it (tests/test_profile_panel.py:L1138-L1142, tests/test_profile_panel.py:L1200-L1202).
    Concrete fix: Reword these passages to the supported contract: current ContinuousDiD requires untreated D=0 units and positive treated doses; lowest-dose-as-control exists in the paper but is not implemented here. If the project intentionally wants to endorse recoding as acceptable preprocessing, add an explicit Note:/deviation entry to docs/methodology/REGISTRY.md that defines when that recoding is valid and how it changes the estimand, then update the guide/docstrings/tests to cite that new note.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find an existing TODO.md entry that mitigates the methodology issue above.

Security

No findings.

Documentation/Tests

No additional findings beyond the Methodology item. Coverage on the new profile fields and guide wording is otherwise strong, and the earlier stale test-prose issue looks resolved, but I could not execute the suite in this environment because python3 lacks numpy and pytest.

Path to Approval

  1. Remove the unsupported “documented fix” / “registry-documented fix” wording for ContinuousDiD recoding across the touched surfaces (diff_diff/profile.py, diff_diff/guides/llms-autonomous.txt, CHANGELOG.md, ROADMAP.md, and the new assertions/comments in tests/test_profile_panel.py) so they match the current registry.
  2. If recoding guidance is intentional, document it first in docs/methodology/REGISTRY.md with the exact admissible cases and estimand consequences, then update the guide/docstrings/tests to reference that note instead of implying the support already exists.

…fixes" overclaim for ContinuousDiD recoding

P1 (overclaiming registry endorsement of recoding):
Reviewer correctly noted the round-13/14 wording across the
public-facing surfaces called re-encoding the treatment column a
"registry-documented fix" / "documented option" / "documented
fallback". REGISTRY only documents the `P(D=0) > 0` requirement
and explicitly notes Remark 3.1's lowest-dose-as-control fallback
is NOT implemented in this library. Re-encoding is an agent-side
preprocessing choice that the registry neither endorses nor
forbids — calling it "registry-documented" was an over-claim.

Reworded twelve sites to drop the "documented" framing:
- `diff_diff/profile.py` `TreatmentDoseShape` docstring (items
  (1) and (5)).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference
  When-(1)-or-(5)-fails paragraph.
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet
  trailing language.
- `diff_diff/guides/llms-autonomous.txt` §4.7 trailing paragraph
  (consolidated to a pointer at §2; reduced redundancy).
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain
  counter-example #4.
- `tests/test_profile_panel.py` two test docstrings + one
  inline assertion message + one trailing comment.
- `CHANGELOG.md` Wave 2 entry.
- `ROADMAP.md` AI-Agent Track building block.

The corrected framing across all surfaces:
- Honestly state the contract: `ContinuousDiD` requires
  `P(D=0) > 0` and positive treated doses; Remark 3.1 not
  implemented.
- When the contract isn't met, say `ContinuousDiD` "as currently
  implemented does not apply" — not "do this fix."
- Mention routing alternatives that ARE in the library and DON'T
  require `P(D=0) > 0`: `HeterogeneousAdoptionDiD`, linear DiD
  with a continuous covariate. Those are routing facts, not
  methodology endorsements.
- Re-encoding stays in the prose as an "agent-side preprocessing
  choice that changes the estimand and is not documented in
  REGISTRY as a supported fallback" — explicitly NOT endorsed.

Length housekeeping: trimmed redundancy in the §4.7 trailing
paragraph (consolidated to a pointer at §2) and tightened the §2
recoding paragraph; autonomous (65984 chars) < full (66031),
`test_full_is_largest` green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 31a843a37a6c229445d003d694b26a0acebd9233


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review P1 about ContinuousDiD recoding language looks resolved: the new guide/docstrings now say re-encoding changes the estimand and is not a registry-backed fallback.
  • Severity P1 [Newly identified]: the new ContinuousDiD preflight guidance leans on PanelProfile.has_never_treated, but its implementation does not match its documented contract for continuous panels.
  • diff_diff/profile.py:L356-L360 defines has_never_treated as “some unit has treatment == 0 in every observed non-NaN row,” while diff_diff/profile.py:L533-L540 sets it from groupby(unit).max() == 0 only.
  • That can silently false-positive on mixed negative/zero dose paths, and the new autonomous guide now uses that field as the proxy for P(D=0) > 0 in diff_diff/guides/llms-autonomous.txt:L115-L121 and diff_diff/guides/llms-autonomous.txt:L228-L245.
  • I did not find new estimator-math, variance/SE, NaN-inference, or parameter-propagation issues in the touched runtime code.
  • Static review only: I could not execute the touched tests locally because the available Python environment lacks pandas.

Methodology

  • Severity: P1 [Newly identified]. Impact: the PR’s new ContinuousDiD routing surface assumes has_never_treated means “zero in every observed row,” but the implementation returns True whenever any unit’s maximum observed dose is 0. On a continuous panel with a unit path like [-1, 0], the flag is wrong even though no unit is actually a zero-dose control. Because the new agent-facing docs treat this field as the P(D=0) > 0 proxy, the PR can now misstate control availability before fit. Relevant locations: diff_diff/profile.py:L356-L360, diff_diff/profile.py:L533-L540, diff_diff/guides/llms-autonomous.txt:L115-L121, diff_diff/guides/llms-autonomous.txt:L228-L245, tests/test_profile_panel.py:L461-L477. Concrete fix: compute the flag from “all observed non-NaN doses for a unit equal 0” (for example, require both per-unit min and max to be 0, or an explicit all-zero check), then add a regression covering a negative/zero mixed path and asserting has_never_treated is False.

Code Quality

  • No findings beyond the Methodology item.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate tech-debt findings. TODO.md does not currently track the P1 above, so it remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • No additional findings beyond the Methodology item. The new guide/tests are otherwise materially better aligned with the registry than in the previous review, especially on removing the unsupported “documented fix” framing for ContinuousDiD recoding.

Path to Approval

  1. Fix diff_diff/profile.py so continuous has_never_treated only counts units whose full observed dose path is zero.
  2. Add a regression in tests/test_profile_panel.py for a continuous panel containing a mixed negative/zero unit path (for example [-1, 0]) asserting that it does not count as never-treated.
  3. Recheck the new autonomous-guide wording after that fix lands, since the PR now treats has_never_treated as the zero-dose-control proxy for ContinuousDiD.

…ementation to match documented contract on continuous panels with negative doses

Reviewer correctly identified an implementation bug between the
documented contract and the current code:

- Documented (line 356-360): "some unit has treatment == 0 in
  every observed non-NaN row"
- Implemented (line 540): `bool(np.any(unit_max == 0))`

These are equivalent on binary panels (values in {0, 1}) but DIVERGE
on continuous panels with negative dose support. A unit path
`[-1, 0]` has `unit_max == 0` but is NOT a zero-dose never-treated
control — some rows are negative.

The autonomous guide's ContinuousDiD preflight now treats
`has_never_treated` as the proxy for `P(D=0) > 0`, so a false-
positive on a `[-1, 0]` path would silently mislead an agent about
control availability. Reviewer flagged this as P1.

Fixed the implementation to enforce both endpoints:
`bool(np.any((unit_max == 0) & (unit_min == 0)))`. For any numeric
panel this is exactly "every observed dose is 0" — matches the
docstring. Updated the inline comment to explain why both
endpoints are required.

Added `test_continuous_negative_then_zero_does_not_count_as_never_treated`
covering the exact bug shape: a unit path of `[-1, -1, 0, 0]` has
`unit_max == 0` but is not a never-treated control. Test asserts
`has_never_treated is False` on the panel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e712742f3faa11ceefdff91389a4ec736513f2e3


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review P1 on has_never_treated semantics appears resolved: _classify_treatment() now requires both per-unit min and max to be 0, and a regression covers the [-1, 0] path case in diff_diff/profile.py:L533-L548 and tests/test_profile_panel.py:L480-L509.
  • I did not find new estimator-math, variance/SE, NaN-inference, or parameter-propagation defects in the touched runtime code.
  • P1: the new ContinuousDiD fallback guidance is too broad on negative-dose failures. The PR now presents HeterogeneousAdoptionDiD as an alternative when treatment_dose.dose_min <= 0, but HAD’s documented scope and validator require nonnegative doses.
  • That mismatch is duplicated across public-facing surfaces and even a new test docstring, so the PR is currently baking incorrect methodology guidance into the agent path.
  • Static review only: I could not run the touched tests here because pytest is not installed (python -m pytest -> No module named pytest).

Methodology

  • Severity: P1. Impact: affected methods are ContinuousDiD and HeterogeneousAdoptionDiD. The new profile/guide prose says that when ContinuousDiD fails because dose_min <= 0, a routing alternative is HeterogeneousAdoptionDiD. That is not consistent with the Methodology Registry or HAD’s front-door validator: HAD is defined on nonnegative dose support (D_{g,2} >= 0) and explicitly raises on negative post-period dose. A user or agent following the new guidance can be steered into an unsupported estimator path. Relevant locations: docs/methodology/REGISTRY.md:L2257-L2263, diff_diff/had.py:L1450-L1458, diff_diff/profile.py:L119-L129, diff_diff/guides/llms-autonomous.txt:L244-L249, diff_diff/guides/llms-autonomous.txt:L565-L570, tests/test_profile_panel.py:L1240-L1245, tests/test_profile_panel.py:L1279-L1281, plus the duplicated release-note surfaces in CHANGELOG.md:L13 and ROADMAP.md:L141. Concrete fix: split the failure modes. Keep HAD only for nonnegative graded-adoption panels that satisfy HAD’s own design contract. For dose_min < 0, say ContinuousDiD does not apply; either re-encode to nonnegative support with the estimand change explicitly documented, or use a method that treats dose as a signed continuous covariate. Update the duplicated prose and add a regression that forbids presenting HAD as a direct fallback in the negative-dose example.

Code Quality

  • No additional findings beyond the Methodology item.

Performance

  • No findings.

Maintainability

  • No additional findings beyond the Methodology item. The main maintainability risk is the same incorrect fallback text being duplicated across multiple files.

Tech Debt

  • No separate tech-debt findings. The P1 above is not tracked in TODO.md, so it remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • No separate findings beyond the Methodology item. The new guide/tests materially improve alignment overall, but tests/test_profile_panel.py:L1240-L1245 and L1279-L1281 currently encode the same unsupported HAD fallback for negative-dose panels.
  • Static-review limitation: I could not execute the touched tests in this environment because pytest is unavailable.

Path to Approval

  1. Update diff_diff/profile.py and diff_diff/guides/llms-autonomous.txt so HAD is no longer described as a direct fallback for dose_min < 0 / negative treated-dose panels.
  2. Update the duplicated prose in CHANGELOG.md, ROADMAP.md, and the negative-dose comments/docstring in tests/test_profile_panel.py to match the HAD nonnegative-dose contract in REGISTRY.md.
  3. Add a guide regression in tests/test_guides.py asserting that the negative-dose ContinuousDiD counterexample does not route directly to HeterogeneousAdoptionDiD unless the treatment has first been re-encoded onto nonnegative support and the estimand change is called out.

…s "negative dose" branches; HAD only valid on the former

Reviewer correctly noted that the round-15/16 wording listed
`HeterogeneousAdoptionDiD` as a routing alternative whenever
`ContinuousDiD` fails on the dose-related preflights, but HAD
itself requires non-negative dose support and raises on negative
post-period dose at `had.py:1450-1459` (paper Section 2). On a
panel with `dose_min < 0`, routing to HAD silently steers an agent
into the same fit-time error. Verified the rejection at
`had.py:1450-1459`.

Reworded every site to split the two failure modes:

- Branch (a): `has_never_treated == False` (no zero-dose controls
  but all observed doses non-negative). `ContinuousDiD` does not
  apply (Remark 3.1 not implemented). HAD IS a routing alternative
  on this branch (HAD's contract requires non-negative dose,
  satisfied here); linear DiD with a continuous covariate is
  another.
- Branch (e): `dose_min < 0` (negative treated doses).
  `ContinuousDiD` does not apply AND HAD is **not** a fallback
  either — HAD raises on negative post-period dose
  (`had.py:1450-1459`). Linear DiD with a signed continuous
  covariate is the applicable alternative on this branch.

Updated wording across:
- `diff_diff/profile.py` `TreatmentDoseShape` docstring (refactored
  from item-by-item duplication into a numbered list with a single
  "Routing alternatives when (1) or (5) fails" section that splits
  the two branches; trimmed redundancy).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (split
  the When-(1)-or-(5)-fails paragraph into the two branches).
- `diff_diff/guides/llms-autonomous.txt` §4.7 trailing paragraph
  (consolidated to a pointer at §2's split discussion).
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain
  counter-example #4 (no never-treated branch: HAD applies) and
  counter-example #5 (negative-dose branch: HAD does NOT apply,
  cite `had.py:1450-1459`).
- `CHANGELOG.md` Wave 2 entry.
- `ROADMAP.md` AI-Agent Track building block.
- `tests/test_profile_panel.py` two test docstrings/comments.

Added `test_autonomous_negative_dose_path_does_not_route_to_had`
in `tests/test_guides.py` asserting that §5.2 explicitly cites
`had.py:1450-1459` on the negative-dose branch (used a single-
line fingerprint since the prose phrase "non-negative dose
support" is split across newlines in the rendered guide).

Length housekeeping: trimmed counter-example #4 and #5 prose +
§4.7 trailing paragraph to point at §2's split discussion;
autonomous (65374 chars) < full (66031), `test_full_is_largest`
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

The `llms-full.txt > llms-autonomous.txt` length assertion was a
stale Wave-1 invariant. Original assumption: "full" is the
comprehensive reference and therefore the longest by definition.
Reality after Wave 1 + Wave 2 + 17 review rounds: the autonomous
reference guide has grown organically (estimator-support matrix,
per-design-feature reasoning, three worked examples, contract
citations like `had.py:1450-1459`) while `llms-full.txt` is the
API-docs roll-up and hasn't been updated in lockstep with
Wave 2's agent-facing content. The two guides serve different
audiences — there is no user-facing invariant requiring one to be
larger than the other.

The invariant was forcing artificial trims: each review round
that added a contract caveat to autonomous required compensating
prose deletion elsewhere just to keep autonomous below full. Each
trim made autonomous marginally less complete without
methodological benefit.

Removed the autonomous comparison; kept `full > concise` and
`full > practitioner` since `concise` is by definition shorter and
`practitioner` is workflow prose without the API roll-up. Added a
test docstring explaining the rationale so future editors don't
reintroduce the comparison.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 872abc8bc04f3e326d2269898003f32d1d70c72b


Overall Assessment

✅ Looks good

Executive Summary

  • The prior re-review P1 appears resolved: the negative-dose branch now consistently says HeterogeneousAdoptionDiD is not a fallback, across diff_diff/profile.py:L114-L138, diff_diff/guides/llms-autonomous.txt:L244-L267 and L901-L910, CHANGELOG.md:L13-L13, ROADMAP.md:L141-L141, with regressions in tests/test_guides.py:L166-L197 and tests/test_profile_panel.py:L1229-L1294.
  • The runtime has_never_treated fix now matches the documented contract for continuous panels by requiring a candidate never-treated unit to be exactly zero in every observed row, avoiding the [-1, 0] false positive in diff_diff/profile.py:L535-L550; this is covered by tests/test_profile_panel.py:L480-L509.
  • I did not find new estimator-math, identification, weighting, variance/SE, NaN-inference, or parameter-propagation defects in the touched runtime code.
  • The new count-outcome guidance is aligned with Wooldridge Poisson’s non-negative-outcome requirement and ASF outcome-scale ATT reporting in diff_diff/profile.py:L721-L745, diff_diff/guides/llms-autonomous.txt:L188-L201 and L714-L737, consistent with docs/methodology/REGISTRY.md:L1335-L1349 and diff_diff/wooldridge.py:L1102-L1109.
  • Static review only: I could not run the touched tests here because pytest is not installed, and import-time smoke checks were also blocked because numpy is unavailable in this environment.

Methodology

  • Severity: P3 (informational, resolved prior finding). Impact: the previously reported negative-dose routing mismatch is fixed; the changed surfaces now correctly distinguish the “no never-treated, nonnegative dose” branch from the “negative-dose” branch, where HAD is inapplicable. Concrete fix: none. References: diff_diff/profile.py:L114-L138, diff_diff/guides/llms-autonomous.txt:L244-L267 and L901-L910, docs/methodology/REGISTRY.md:L2257-L2263 and L2504-L2506, diff_diff/had.py:L1446-L1459, diff_diff/continuous_did.py:L287-L294.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new untracked tech-debt findings.

Security

No findings.

Documentation/Tests

No findings in the touched docs/tests. The new guide/profile regressions cover the prior methodology issue and the continuous has_never_treated edge case in tests/test_guides.py:L37-L219 and tests/test_profile_panel.py:L480-L509 plus L919-L1294. Static limitation: I could not execute those tests in this environment because pytest is unavailable.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 25, 2026
@igerber igerber merged commit 47d3e02 into main Apr 25, 2026
23 of 24 checks passed
@igerber igerber deleted the agent-profile-shape-extensions branch April 25, 2026 14:33
igerber added a commit that referenced this pull request Apr 25, 2026
…cross all canonical surfaces

P2 - HeterogeneousAdoptionDiD shipped (Phases 1a-3 + 3-followup + 4.5 A/B/C0
merged via PRs #366/#367) but missing from the new authoritative documentation
surfaces this PR establishes
- README.md ## Estimators catalog: add one-line entry after ContinuousDiD
  (de Chaisemartin, Ciccia, D'Haultfoeuille & Knau 2026; alias HAD).
- diff_diff/guides/llms.txt ## Estimators: add matching one-liner.
- .claude/commands/docs-check.md required-estimators table: add HAD row
  pointing at had.rst as the API target.
- docs/references.rst: add new "Heterogeneous Adoption (No-Untreated
  Designs)" sub-section with the de Chaisemartin et al. (2026)
  arXiv:2405.04465v6 citation.
- docs/api/index.rst: add HeterogeneousAdoptionDiD to estimators
  autosummary; add HeterogeneousAdoptionDiDResults +
  HeterogeneousAdoptionDiDEventStudyResults to results autosummary.
- docs/api/had.rst (NEW): autoclass page for the three classes with a
  brief intro and a "When to use HAD" note pointing at sibling estimators
  (ContinuousDiD for never-treated controls, dCDH for binary reversible).
- docs/doc-deps.yaml: add diff_diff/had.py + diff_diff/had_pretests.py
  source mappings (REGISTRY methodology, had.rst api_reference, README
  catalog, references.rst, llms.txt). The llms-full.txt mapping is
  intentionally omitted with a comment - that section is deferred to the
  Phase 5 follow-up tracked in TODO.md.

TODO.md: narrow the Phase 5 entry from "llms.txt updates" to "llms-full.txt
HeterogeneousAdoptionDiD section" since the catalog one-liner, API page,
and bibliography landed here. practitioner_next_steps() integration and
tutorial notebook remain deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…cross all canonical surfaces

P2 - HeterogeneousAdoptionDiD shipped (Phases 1a-3 + 3-followup + 4.5 A/B/C0
merged via PRs #366/#367) but missing from the new authoritative documentation
surfaces this PR establishes
- README.md ## Estimators catalog: add one-line entry after ContinuousDiD
  (de Chaisemartin, Ciccia, D'Haultfoeuille & Knau 2026; alias HAD).
- diff_diff/guides/llms.txt ## Estimators: add matching one-liner.
- .claude/commands/docs-check.md required-estimators table: add HAD row
  pointing at had.rst as the API target.
- docs/references.rst: add new "Heterogeneous Adoption (No-Untreated
  Designs)" sub-section with the de Chaisemartin et al. (2026)
  arXiv:2405.04465v6 citation.
- docs/api/index.rst: add HeterogeneousAdoptionDiD to estimators
  autosummary; add HeterogeneousAdoptionDiDResults +
  HeterogeneousAdoptionDiDEventStudyResults to results autosummary.
- docs/api/had.rst (NEW): autoclass page for the three classes with a
  brief intro and a "When to use HAD" note pointing at sibling estimators
  (ContinuousDiD for never-treated controls, dCDH for binary reversible).
- docs/doc-deps.yaml: add diff_diff/had.py + diff_diff/had_pretests.py
  source mappings (REGISTRY methodology, had.rst api_reference, README
  catalog, references.rst, llms.txt). The llms-full.txt mapping is
  intentionally omitted with a comment - that section is deferred to the
  Phase 5 follow-up tracked in TODO.md.

TODO.md: narrow the Phase 5 entry from "llms.txt updates" to "llms-full.txt
HeterogeneousAdoptionDiD section" since the catalog one-liner, API page,
and bibliography landed here. practitioner_next_steps() integration and
tutorial notebook remain deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant